feat(ethexe/node-loader): add per-batch watchdog, SIGTERM, heartbeat#5393
feat(ethexe/node-loader): add per-batch watchdog, SIGTERM, heartbeat#5393grishasobol merged 3 commits intomasterfrom
Conversation
Changed Files
|
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request improves the robustness and observability of the node loader. By implementing per-batch timeouts, the system can now recover from stalled RPC operations. Additionally, the inclusion of SIGTERM support and periodic progress heartbeats ensures that the loader behaves more predictably in containerized environments and provides better operational feedback. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
- Wrap each batch in a 180s tokio::time::timeout so a hung RPC call drops the worker's batch instead of stalling the pool forever. - Listen for SIGTERM in addition to SIGINT so `docker stop` triggers the same graceful drain + summary as Ctrl+C. - Emit an INFO heartbeat every 10s with completed/failed/in_flight counters so a stalled pool is visible without grepping DEBUG. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d9ccb13 to
1e28fae
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a watchdog timeout for batch execution and a periodic heartbeat logger to monitor pool progress in the node-loader. It also enhances shutdown handling by supporting both SIGINT and SIGTERM signals for graceful termination. Review feedback suggests simplifying the heartbeat interval initialization and refining the placement of signal handling within asynchronous selection blocks.
I am having trouble creating individual review comments. Click here to see my feedback.
ethexe/node-loader/src/batch.rs (372-374)
The heartbeat interval is initialized and then immediately ticked. This is redundant because the interval is already configured to skip missed ticks. The initial tick can be removed to simplify the logic.
let mut heartbeat = tokio::time::interval(HEARTBEAT_INTERVAL);
heartbeat.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);
ethexe/node-loader/src/main.rs (447)
The tokio::signal::ctrl_c() call should be placed directly within the tokio::select! branch to ensure it is polled correctly. This avoids potential issues with signal consumption when the future is created outside the select block.
result = tokio::signal::ctrl_c() => Ok("Ctrl+C / SIGINT"),
|
@claude fix review and also fix windows problems https://github.com/gear-tech/gear/actions/runs/25118845247/job/73613584380?pr=5393 |
|
Claude finished @grishasobol's task in 3m 5s —— View job Fixed Windows CI Failure
Fix applied ( Wrapped the async fn wait_for_shutdown_signal() -> std::io::Result<&'static str> {
#[cfg(unix)]
{
use tokio::signal::unix::{SignalKind, signal};
let mut sigterm = signal(SignalKind::terminate())?;
tokio::select! {
result = tokio::signal::ctrl_c() => result.map(|()| "Ctrl+C / SIGINT"),
_ = sigterm.recv() => Ok("SIGTERM"),
}
}
#[cfg(not(unix))]
{
tokio::signal::ctrl_c().await.map(|()| "Ctrl+C / SIGINT")
}
}Gemini review notes:
|
tokio::signal::unix is unavailable on Windows (x86_64-pc-windows-msvc), causing a compilation error. Wrap the SIGTERM branch with #[cfg(unix)] and add a #[cfg(not(unix))] fallback that handles SIGINT only. Co-authored-by: Gregory Sobol <grishasobol@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
tokio::time::timeoutso a hung RPC call drops the worker's batch instead of stalling the pool forever.SIGTERMin addition toSIGINTsodocker stoptriggers the same graceful drain + summary asCtrl+C.completed/failed/in_flightcounters so a stalled pool is visible without grepping DEBUG.